The Linear Model: \[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\]
Where:
\[\mathbf{X} = \begin{bmatrix}1 & x_{11} & x_{12} & \cdots & x_{1p} \\1 & x_{21} & x_{22} & \cdots & x_{2p} \\\vdots & \vdots & \vdots & \ddots & \vdots \\1 & x_{n1} & x_{n2} & \cdots & x_{np}\end{bmatrix}\]
Parameters vs. Estimates:
Observed vs. Predicted:
Definition: Difference between observed and predicted values
\[\hat\varepsilon_i = y_i - \hat{y}_i\]
In matrix form:
\[\hat{\boldsymbol{\varepsilon}} = \mathbf{y} - \hat{\mathbf{y}}\]
Sum of Squared Errors:
\[\text{SSE} = (\mathbf{y}- \mathbf{X}\boldsymbol\beta)^T(\mathbf{y}- \mathbf{X}\boldsymbol\beta)\]
Take derivative with respect to \(\boldsymbol{\beta}\) and set to zero:
\[\frac{\partial \text{SSE}}{\partial \boldsymbol{\beta}} = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{0}\]
Rearrange: \[\mathbf{X}^T\mathbf{X}\boldsymbol{\beta} = \mathbf{X}^T\mathbf{y}\]
These are the normal equations
Solve for \(\hat{\boldsymbol{\beta}}\):
\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\]
This is the ordinary least squares estimator
Definition: \[\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\]
What it does: \[\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\]
Symmetric: \(\mathbf{H}^T = \mathbf{H}\)
Idempotent: \(\mathbf{H}^2 = \mathbf{H}\)
\(\mathbf{I} - \mathbf{H}\) is also symmetric and idempotent
\[\hat{\boldsymbol{\varepsilon}} = (\mathbf{I} - \mathbf{H})\mathbf{y}\]
Problems with the traditional approach:
QR decomposition provides a more stable solution
Any matrix \(\mathbf{X}\) can be decomposed as:
\[\mathbf{X} = \mathbf{Q}\mathbf{R}\]
Where:
Traditional: \(\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}\)
With QR: If \(\mathbf{X} = \mathbf{Q}\mathbf{R}\), then:
\[\begin{align} \hat{\boldsymbol{\beta}} &= (\mathbf{R}^T\mathbf{R})^{-1}\mathbf{R}^T\mathbf{Q}^T\mathbf{y} \\ &= \mathbf{R}^{-1}\mathbf{Q}^T\mathbf{y} \end{align}\]
Solve using back substitution: \(\mathbf{R}\hat{\boldsymbol{\beta}} = \mathbf{Q}^T\mathbf{y}\)
Upper triangular system is easy to solve:
\[\begin{bmatrix} r_{11} & r_{12} & r_{13} \\ 0 & r_{22} & r_{23} \\ 0 & 0 & r_{33} \end{bmatrix} \begin{bmatrix} \hat\beta_1 \\ \hat\beta_2 \\ \hat\beta_3 \end{bmatrix} = \begin{bmatrix} q_1 \\ q_2 \\ q_3 \end{bmatrix}\]
Work backwards:
Expected value (linearity): \[E[\mathbf{A}\mathbf{Y} + \mathbf{b}] = \mathbf{A}E[\mathbf{Y}] + \mathbf{b}\]
Variance-covariance transformation: \[\text{Var}(\mathbf{A}\mathbf{Y} + \mathbf{b}) = \mathbf{A}\text{Var}(\mathbf{Y})\mathbf{A}^T\]
Linearity: \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\)
Zero mean errors: \(E[\boldsymbol{\varepsilon}] = \mathbf{0}\)
Homoscedasticity & independence: \(\text{Var}(\boldsymbol{\varepsilon}) = \sigma^2\mathbf{I}\)
Full rank: \(\mathbf{X}\) has full column rank
Theorem: Under the GM assumptions, OLS is BLUE
BLUE = Best Linear Unbiased Estimator
Show: \(E[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}\)
Be able to demonstrate why!
Total Sum of Squares (TSS): \[\text{TSS} = \sum_{i=1}^n (y_i - \bar{y})^2 = \mathbf{y}^T\mathbf{y} - n\bar{y}^2\]
Regression Sum of Squares (SS\(_{\text{Reg}}\)): \[\text{SS}_{\text{Reg}} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2 = \mathbf{y}^T\mathbf{H}\mathbf{y} - n\bar{y}^2\]
Sum of Squared Errors (SSE): \[\text{SSE} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \mathbf{y}^T(\mathbf{I} - \mathbf{H})\mathbf{y}\]
TSS: Total variation in the data
“How spread out are my y-values?”
SS\(_{\text{Reg}}\): Variation explained by the model
“How much variation did our model capture?”
SSE: Variation left unexplained
“How much did we miss with our model?”
Total variation = Explained + Unexplained
\[\text{TSS} = \text{SS}_{\text{Reg}} + \text{SSE}\]
Definition: Proportion of variation explained by the model
\[R^2 = \frac{\text{SS}_{\text{Reg}}}{\text{TSS}} = 1 - \frac{\text{SSE}}{\text{TSS}}\]
True error variance: \[\text{Var}(\varepsilon_i) = \sigma^2\]
Unbiased estimator: \[\hat{\sigma}^2 = \frac{\text{SSE}}{n-p}\]
where \(n-p\) are the degrees of freedom (observations minus parameters)
Also called Mean Square Error (MSE)
Null hypothesis: \(H_0: \beta_j = 0\)
Test statistic: \[t = \frac{\hat{\beta}_j}{\text{se}(\hat{\beta}_j)} \sim t_{n-p}\]
\[\text{se}(\hat{\beta}_j) = \sqrt{\hat{\sigma}^2[(\mathbf{X}^T\mathbf{X})^{-1}]_{jj}}\]
P-value: \(P(|T| \geq |t|)\) where \(T \sim t_{n-p}\)
Recall: \(\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\)
For individual coefficient \(j\): \[\text{Var}(\hat{\beta}_j) = \sigma^2[(\mathbf{X}^T\mathbf{X})^{-1}]_{jj}\]
Estimate by replacing \(\sigma^2\) with \(\hat{\sigma}^2\): \[\widehat{\text{Var}}(\hat{\beta}_j) = \hat{\sigma}^2[(\mathbf{X}^T\mathbf{X})^{-1}]_{jj}\]
Standard error is the square root: \[\text{se}(\hat{\beta}_j) = \sqrt{\widehat{\text{Var}}(\hat{\beta}_j)}\]
Tests whether the model is useful at all
Null: \(H_0: \beta_1 = \beta_2 = \cdots = \beta_{p-1} = 0\)
(all slopes equal zero, only intercept remains)
Test statistic: \[F = \frac{\text{SS}_{\text{Reg}}/(p-1)}{\text{SSE}/(n-p)} \sim F_{p-1, n-p}\]
\[F = \frac{\text{Mean Square Regression}}{\text{Mean Square Error}}\]
Large F: Model explains a lot relative to noise
Small F: Model doesn’t explain much more than noise
Alternative F-statistic form: \[F = \frac{R^2/(p-1)}{(1-R^2)/(n-p)}\]
This shows the F-test is testing whether R² is significantly different from 0
Framework: \(H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{d}\)
Where:
Test statistic: \[F = \frac{(\mathbf{C}\hat{\boldsymbol{\beta}} - \mathbf{d})^T[\mathbf{C}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{C}^T]^{-1}(\mathbf{C}\hat{\boldsymbol{\beta}} - \mathbf{d})/q}{\text{SSE}/(n-p)}\]
Under \(H_0\): \(F \sim F_{q, n-p}\)
Single coefficient: \(H_0: \beta_1 = 0\) \[\mathbf{C} = [0, 1, 0, 0], \quad \mathbf{d} = 0\]
Equality of coefficients: \(H_0: \beta_1 = \beta_2\) \[\mathbf{C} = [0, 1, -1, 0], \quad \mathbf{d} = 0\]
Multiple restrictions: \(H_0: \beta_2 = \beta_3 = 0\) \[\mathbf{C} = \begin{bmatrix} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}, \quad \mathbf{d} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\]
| Source | df | Sum of Squares | Mean Square | F |
|---|---|---|---|---|
| Regression | \(p-1\) | SS\(_{\text{Reg}}\) | MS\(_{\text{Reg}}\) | MS\(_{\text{Reg}}\)/MSE |
| Error | \(n-p\) | SSE | MSE | |
| Total | \(n-1\) | TSS |
Definition: Probability of observing a test statistic as extreme or more extreme than what we observed, assuming \(H_0\) is true
For t-tests (two-sided): \[\text{p-value} = P(|T| \geq |t|) = 2 \times P(T \geq |t|)\]
Definition: Probability of observing a test statistic as extreme or more extreme than what we observed, assuming \(H_0\) is true
For F-tests (one-sided): \[\text{p-value} = P(F_{q,n-p} \geq f)\]